require( lme4 )
Loading required package: lme4
Loading required package: Matrix
package ‘Matrix’ was built under R version 3.4.1
Attaching package: ‘Matrix’
The following object is masked from ‘package:tidyr’:
expand
This dataset covers the period of 2000–2017 and includes 62560 records of fines levied against corporations relating to violations of regulations, from cases initiated by 43 federal regulatory agencies. I downloaded the data from the Good Jobs First “Violation Tracker”, using the search GUI with all the options set to <any>.
The Size of Penalties
This dataset contains fields for both Penalty_Amount and Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting, but is clear that these adjustments make very little difference to the data as a whole, as these two variables are correlation at r=0.994. I have thus elected to use the adjusted penalty values for all these analyses.
The first pattern to note is that (perhaps predictably) the penalties imposed via criminal proceedings tend to be larger than those imposed via civil proceedings, and larger still when both civil and criminal proceedings have been brought.

The way this pattern is built up is somewhat non-intuitive, as this linear model and boxplot makes clear. The groups mean are well-separated, and easy to distinguish statistically (small p-values), but there is so much variation within groups that the predictive value of the mean differences is small (low R2 value). The vast majority of penalties arise from civil actions – 99.3% of those recorded – and the mean value of these penalties is comparatively small.
summary( lm( Penalty_Adj ~ Civ_Crim_bin, data=viol ) )
Call:
lm(formula = Penalty_Adj ~ Civ_Crim_bin, data = viol)
Residuals:
Min 1Q Median 3Q Max
-5.913e+08 -4.644e+06 -4.640e+06 -4.627e+06 2.080e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 596240136 24921654 23.93 <2e-16 ***
Civ_Crim_bincivil -591589764 24931682 -23.73 <2e-16 ***
Civ_Crim_bincriminal -464209918 26451988 -17.55 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 176200000 on 62557 degrees of freedom
Multiple R-squared: 0.01211, Adjusted R-squared: 0.01207
F-statistic: 383.3 on 2 and 62557 DF, p-value: < 2.2e-16
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Civ_Crim_bin, Penalty_Adj )) +
geom_point( colour="blue", alpha=0.3, position="jitter" ) +
geom_boxplot( outlier.size=0, alpha=0 ) +
coord_flip() +
xlab( "Type of case brought" ) +
scale_y_log10() +
ylab( "Penalty (log10 $'s)" )

Trends over Time
Modelling the relationship between penalty values and the year the penalty was imposed reveals an increasing trend, with the coefficent for year indicating an increase of ~$950k per year. However, the R2 value for this model indicates that it accounts for only 0.05% of the variation in the data.
This makes sense with reference to the scatter plot, where we can see that the trend for annual increase (represented by the dashed black line) is pretty modest when compared to huge variation in the size of penalties (note log10 scale).

Where are Violations Occurring?
Taking a broad-strokes view of the geographical data, we can see that there seems not to be a visually apparent trend in the locations where penalties are levied – this data appears to track pretty well with the locations of population centres, thought here may be more subtle patterns that are not visible at the nation-wide scale.
data( zipcode )
zipcode <- zipcode %>% mutate( zip=factor( zip ), region=substr( zip, 1, 1) )
full_join( viol, zipcode, by=c( "Zip" = "zip" ) ) %>% mutate( Zip=factor( Zip ) ) %>% filter( Civ_Crim_bin=="civil" ) %>%
ggplot() + geom_point( aes( x=longitude, y=latitude, col=Year ), cex=.5 ) +
theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL ) +
scale_y_continuous(limits = c(25,50), breaks = NULL ) +
labs(x=NULL, y=NULL)

However, if we look the number of penalties paid in each state/territory over the course of this 18-yr dataset, we can see that there are many more violations in some states that others, with West Virginia being responsible for 15.22% of all penalties levied.

The bars are colored by the industrial sector of the parent corporation found liable for the penalty – the blue that occupies 93.43% of the West Virginia bar represents corporations classified as “mining and minerals”. It is apparent that WV is unusual in both the number of violations and the number of those violations related to mining.
What are Corporations being Fined For?
Across the 49 industrial sectors represented, it is immediately clear from this barplot the extent to which mining & mineral corporations are over-represented (25.24% of all penalties). It is also clear that the pink areas – indicating ‘workplace safety or health’ violations – comprise the majority of violations in most sectors, but particularly so in mining.
viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x) ) )) +
geom_bar( aes( fill=Primary_Offense )) +
coord_flip() +
theme( axis.text.y=element_text( hjust=0, size=6 ) ) +
ylab( "no. penalties" ) +
xlab( "" ) +
theme( legend.position="none" )




summary( lm( Penalty_Adj ~ Ownership_Structure, data=viol ) )
Call:
lm(formula = Penalty_Adj ~ Ownership_Structure, data = viol)
Residuals:
Min 1Q Median 3Q Max
-2.397e+07 -7.226e+06 -7.218e+06 -6.423e+06 2.079e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23974093 10234950 2.342 0.0192 *
Ownership_Structureprivately held -23300191 10350429 -2.251 0.0244 *
Ownership_Structurepublicly traded -16740801 10266208 -1.631 0.1030
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 177300000 on 62557 degrees of freedom
Multiple R-squared: 0.0002778, Adjusted R-squared: 0.0002458
F-statistic: 8.69 on 2 and 62557 DF, p-value: 0.0001684
summary( lm( Penalty_Adj ~ Facility_State, data=viol ) )
Call:
lm(formula = Penalty_Adj ~ Facility_State, data = viol)
Residuals:
Min 1Q Median 3Q Max
-3.370e+07 -1.822e+07 -2.563e+05 -8.214e+04 2.078e+10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18286776 1272342 14.373 < 2e-16 ***
Facility_StateAlabama -18025172 5189958 -3.473 0.000515 ***
Facility_StateAlaska -17819948 12040148 -1.480 0.138867
Facility_StateAmerican Samoa -17779976 125291812 -0.142 0.887153
Facility_StateArizona -18213383 7523135 -2.421 0.015481 *
Facility_StateArkansas -18157251 9940337 -1.827 0.067761 .
Facility_StateCalifornia -18120373 3613202 -5.015 5.32e-07 ***
Facility_StateColorado -18070030 5634389 -3.207 0.001341 **
Facility_StateConnecticut -18120035 8523108 -2.126 0.033508 *
Facility_StateDelaware -17472454 17763650 -0.984 0.325312
Facility_StateDistrict of Columbia -18256598 31347170 -0.582 0.560299
Facility_StateFlorida -17985468 6470401 -2.780 0.005443 **
Facility_StateGeorgia -18224910 6361712 -2.865 0.004174 **
Facility_StateGuam -18273209 49157427 -0.372 0.710096
Facility_StateHawaii -18222047 17018418 -1.071 0.284297
Facility_StateIdaho -16471900 14522550 -1.134 0.256703
Facility_StateIllinois -17996049 3831115 -4.697 2.64e-06 ***
Facility_StateIndiana -17475494 4916420 -3.555 0.000379 ***
Facility_StateIowa -18156113 9489436 -1.913 0.055715 .
Facility_StateKansas -18133620 9016436 -2.011 0.044311 *
Facility_StateKentucky -18241583 3200510 -5.700 1.21e-08 ***
Facility_StateLouisiana -17578982 8220109 -2.139 0.032477 *
Facility_StateMaine -18257097 15028420 -1.215 0.224432
Facility_StateMaryland -18034588 9165956 -1.968 0.049123 *
Facility_StateMassachusetts -17422929 7274225 -2.395 0.016616 *
Facility_StateMichigan -18154328 6821285 -2.661 0.007783 **
Facility_StateMinnesota -18236304 5908405 -3.087 0.002026 **
Facility_StateMississippi -18055166 10535728 -1.714 0.086587 .
Facility_StateMissouri -18089957 7654616 -2.363 0.018117 *
Facility_StateMontana -18055374 12500535 -1.444 0.148640
Facility_StateNebraska -18198953 11061670 -1.645 0.099928 .
Facility_StateNevada -18031237 5875831 -3.069 0.002151 **
Facility_StateNew Hampshire -18235103 16790237 -1.086 0.277459
Facility_StateNew Jersey -18140599 6309489 -2.875 0.004040 **
Facility_StateNew Mexico -18158949 10917648 -1.663 0.096264 .
Facility_StateNew York -18108973 5287261 -3.425 0.000615 ***
Facility_StateNorth Carolina -17502134 6779308 -2.582 0.009834 **
Facility_StateNorth Dakota -18244123 20498538 -0.890 0.373458
Facility_StateNorthern Mariana Islands -18265388 177184812 -0.103 0.917895
Facility_StateOhio -16541211 4725096 -3.501 0.000464 ***
Facility_StateOklahoma -18148139 9529064 -1.905 0.056849 .
Facility_StateOregon -18184211 14768926 -1.231 0.218235
Facility_StatePennsylvania -18108068 4366808 -4.147 3.38e-05 ***
Facility_StatePuerto Rico -18011010 11368026 -1.584 0.113118
Facility_StateRhode Island -17500532 20919620 -0.837 0.402843
Facility_StateSouth Carolina -18218457 10208322 -1.785 0.074320 .
Facility_StateSouth Dakota -8620063 26443099 -0.326 0.744437
Facility_StateTennessee -18180665 6470401 -2.810 0.004958 **
Facility_StateTexas -17907249 3610898 -4.959 7.10e-07 ***
Facility_StateUtah -18016764 8096628 -2.225 0.026070 *
Facility_StateVermont -18023624 45765432 -0.394 0.693711
Facility_StateVirgin Islands of the U.S. 15412764 38684828 0.398 0.690323
Facility_StateVirginia -17142463 5158794 -3.323 0.000891 ***
Facility_StateWashington -18194236 8358478 -2.177 0.029504 *
Facility_StateWest Virginia -18199133 2529098 -7.196 6.27e-13 ***
Facility_StateWisconsin -14772491 7555345 -1.955 0.050560 .
Facility_StateWyoming -18171775 9778005 -1.858 0.063112 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 177200000 on 62503 degrees of freedom
Multiple R-squared: 0.002203, Adjusted R-squared: 0.001309
F-statistic: 2.464 on 56 and 62503 DF, p-value: 7.378e-09
summary( lm( Penalty_Adj ~ Year, data=viol[ viol$Civ_Crim_bin=="civil", ] ) )
summary( lmer( Penalty_Adj ~ Year + Civ_Crim_bin + ( Year | Facility_State ), data=viol ) )
---
title: "Data Incubator Challenge - Question 3: Project pitch"
author: "Will Pitchers"
date: "Saturday, 28 October 2017"
output: html_notebook
---

```{r load packages}
require( tidyverse )
require( data.table )
require( maps )
require( zipcode )
require( lme4 )
```

This dataset covers the period of 2000--2017 and includes 62560 records of fines levied against corporations relating to violations of regulations, from cases initiated by 43 federal regulatory agencies. I downloaded the data from the Good Jobs First ["Violation Tracker"](https://www.goodjobsfirst.org/violation-tracker), using the search GUI with all the options set to `<any>`.


```{r read in data}
viol <- tbl_df( fread( "/Users/willpitchers/Documents/=Job_Applications_etc/Data_Incubator_2017/violation_tracker_export.csv" ))

names( viol ) <- gsub( " ", "_", names( viol ))

viol <- viol %>% mutate( Year=as.integer( Year ), Industry_code=factor( Industry_in_Record ), Civ_Crim=factor( `Civil/Criminal` ) ) %>%
                  mutate( HQ_State_of_Parent=factor( HQ_State_of_Parent ), HQ_Country_of_Parent=factor( HQ_Country_of_Parent ) ) %>%
                  mutate( Primary_Offense=factor( Primary_Offense ), Penalty_Amount=as.numeric( gsub( "[$,]", "", Penalty_Amount ) ) ) %>% 
                  mutate( Penalty_Adj=as.numeric( gsub( "[$,]", "", Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting )) ) %>% 
                  mutate( Subtraction_From_Penalty=as.numeric( "[$,]", "", Subtraction_From_Penalty ) ) %>% 
                  mutate( Agency=factor( Agency ), Secondary_Offense=factor( Secondary_Offense ), Ownership_Structure=factor( Ownership_Structure ) ) %>% 
                  mutate( Major_Industry_of_Parent=factor( Major_Industry_of_Parent ), Zip=factor( Zip ), Facility_State=factor( Facility_State ) )

viol <- viol %>% mutate( Civ_Crim_bin=factor( ifelse( grepl( "civil and criminal", `Civil/Criminal` )=="TRUE", "both", 
                                              ifelse( grepl( "civil", `Civil/Criminal` )=="TRUE", "civil", "criminal" ))))

# str( viol )
# summary( viol$Year )
viol
```

## The Size of Penalties

This dataset contains fields for both `Penalty_Amount` and `Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting`, but is clear that these adjustments make very little difference to the data as a whole, as these two variables are correlation at r=`r round( cor( viol$Penalty_Amount, viol$Penalty_Adj ), 3)`. I have thus elected to use the adjusted penalty values for all these analyses.

The first pattern to note is that (perhaps predictably) the penalties imposed via *criminal* proceedings tend to be larger than those imposed via *civil* proceedings, and larger still when both civil *and* criminal proceedings have been brought.

```{r penalty civ vs. crim}
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Penalty_Adj )) + 
          geom_density( aes( col=Civ_Crim_bin, fill=Civ_Crim_bin ), alpha=.5 ) + 
          scale_x_log10() + 
          xlab( "Penalty (log10 $'s)" ) + 
          scale_fill_discrete( name="Type of\ncase brought") + 
          scale_colour_discrete( name="Type of\ncase brought" )
```

The way this pattern is built up is somewhat non-intuitive, as this linear model and boxplot makes clear. The groups mean are well-separated, and easy to distinguish statistically (small p-values), but there is so much variation within groups that the predictive value of the mean differences is small (low R^2^ value). The vast majority of penalties arise from civil actions -- `r round(length(viol$Penalty_Adj[viol$Civ_Crim_bin=="civil"])/length(viol$Penalty_Adj)*100,1)`% of those recorded -- and the mean value of these penalties is comparatively small.

```{r no.s crim vs. civ, fig.keep='last'}
summary( lm( Penalty_Adj ~ Civ_Crim_bin, data=viol ) )

viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Civ_Crim_bin, Penalty_Adj )) + 
                                        geom_point( colour="blue", alpha=0.3, position="jitter" ) + 
                                        geom_boxplot( outlier.size=0, alpha=0 ) + 
                                        coord_flip() + 
                                        xlab( "Type of case brought" ) +
                                        scale_y_log10() +
                                        ylab( "Penalty (log10 $'s)" )
```

## Trends over Time

Modelling the relationship between penalty values and the year the penalty was imposed reveals an increasing trend, with the coefficent for year indicating an increase of ~$950k per year. However, the R^2^ value for this model indicates that it accounts for only 0.05% of the variation in the data. 

This makes sense with reference to the scatter plot, where we can see that the trend for annual increase (represented by the dashed black line) is pretty modest when compared to *huge* variation in the size of penalties (note log^10^ scale).

```{r penalty by year}
summary( lm( Penalty_Adj ~ Year, data=viol ) )

viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( x=Year, y=Penalty_Adj ) ) + 
                                        geom_jitter( aes( col=Civ_Crim_bin ), alpha=0.5, width=.2, height=0 ) +
                                        ylab( "Penalty amount (log10 $'s)" ) +
                                        geom_smooth( method="lm", col="black", lty=2 ) + 
                                        scale_colour_discrete( name="Type of\ncase brought" ) +
                                        scale_y_log10()
```

## Where are Violations Occurring?

Taking a broad-strokes view of the geographical data, we can see that there seems not to be a visually apparent trend in the locations where penalties are levied -- this data appears to track pretty well with the locations of population centres, thought here may be more subtle patterns that are not visible at the nation-wide scale.

```{r penalty map, warning=FALSE}
data( zipcode )
zipcode <- zipcode %>% mutate( zip=factor( zip ), region=substr( zip, 1, 1) )

full_join( viol, zipcode, by=c( "Zip" = "zip" ) ) %>% mutate( Zip=factor( Zip ) ) %>% filter( Civ_Crim_bin=="civil" ) %>% 
    ggplot() + geom_point( aes( x=longitude, y=latitude, col=Year ), cex=.5 ) + 
        theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL ) + 
        scale_y_continuous(limits = c(25,50), breaks = NULL ) + 
        labs(x=NULL, y=NULL)

```

However, if we look the number of penalties paid in each state/territory over the course of this 18-yr dataset, we can see that there are *many* more violations in some states that others, with West Virginia being responsible for `r round(nrow(viol[viol$Facility_State=="West Virginia",])/nrow(viol[viol$Facility_State!="",])*100,2)`% of all penalties levied. 

```{r}
viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Facility_State, Facility_State, function(x)-length(x) ) )) + 
                                              geom_bar( aes( fill=Major_Industry_of_Parent )) + 
                                              coord_flip() +
                                              theme( axis.text.y=element_text( hjust=0, size=6 ) ) +
                                              ylab( "no. penalties" ) + 
                                              xlab( "" ) + 
                                              theme( legend.position="none" )
```

The bars are colored by the industrial sector of the parent corporation found liable for the penalty -- the blue that occupies `r round( nrow(viol[viol$Facility_State=="West Virginia"&viol$Major_Industry_of_Parent=="mining and minerals",]) / nrow(viol[viol$Facility_State=="West Virginia",])*100,2)`% of the West Virginia bar represents corporations classified as "mining and minerals". It is apparent that WV is unusual in both the number of violations *and* the number of those violations related to mining.

## What are Corporations being Fined For?

Across the 49 industrial sectors represented, it is immediately clear from this barplot the extent to which mining & mineral corporations are over-represented (`r round( nrow(viol[viol$Major_Industry_of_Parent=="mining and minerals",]) / nrow(viol)*100,2)`% of all penalties). It is also clear that the pink areas -- indicating 'workplace safety or health' violations -- comprise the majority of violations in most sectors, but particularly so in mining.

```{r}
viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x) ) )) + 
                                              geom_bar( aes( fill=Primary_Offense )) + 
                                              coord_flip() +
                                              theme( axis.text.y=element_text( hjust=0, size=6 ) ) +
                                              ylab( "no. penalties" ) + 
                                              xlab( "" ) + 
                                              theme( legend.position="none" )
```

```{r}
viol %>% filter( Facility_State=="West Virginia" ) %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x)) )) + geom_bar() + coord_flip()
```


```{r}
summary( viol$Major_Industry_of_Parent )
viol %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x)) )) + geom_bar() + coord_flip()
```

```{r}
summary( viol$Primary_Offense )
viol %>% ggplot( aes( reorder( Primary_Offense, Primary_Offense, function(x)-length(x)) )) + geom_bar() + coord_flip()
```
 

```{r}
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Penalty_Adj )) + geom_density( aes( col=Ownership_Structure, fill=Ownership_Structure ), alpha=.5 ) + scale_x_log10()

summary( lm( Penalty_Adj ~ Ownership_Structure, data=viol ) )
```


```{r}
viol %>% filter( Facility_State != "" ) %>% count( Primary_Offense, Facility_State ) %>% arrange( -n ) 
viol %>% filter( Facility_State != "" ) %>% count( Facility_State, Major_Industry_of_Parent ) %>% arrange( -n ) 

viol %>% filter( Facility_State != "" ) %>% count( Facility_State, Primary_Offense, Major_Industry_of_Parent ) %>% arrange( -n ) 

viol %>% filter( Facility_State != "" ) %>% count( Primary_Offense, Facility_State ) %>% arrange( -n ) %>% 
        ggplot( aes( Primary_Offense, Facility_State ) ) + geom_tile( aes( fill=n ), col="white" ) + 
          scale_fill_gradient( low = "white", high = "steelblue" ) + theme_minimal() + 
            theme( axis.text.x = element_text( angle=330, hjust=0, size=8 ), axis.text.y=element_text( hjust=0, size=6 ) )

viol %>% filter( Facility_State != "" ) %>% count( Major_Industry_of_Parent, Facility_State ) %>% arrange( -n ) 
```

```{r}
summary( lm( Penalty_Adj ~ Year  + Primary_Offense + Facility_State, data=viol ) )
```




```{r}
summary( lm( Penalty_Adj ~ Year, data=viol[ viol$Civ_Crim_bin=="civil", ] ) )


summary( lmer( Penalty_Adj ~ Year + Civ_Crim_bin + ( Year | Facility_State ), data=viol ) )
```
